-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Spark] Make delta.dataSkippingStatsColumns
more lenient for nested columns
#2850
base: master
Are you sure you want to change the base?
Conversation
@kamcheungting-db since you added this originally |
@longvu-db since you're reviewing my other PR 😊. This would be a great quality of life improvement for complex nested schemas |
@Kimahriman Will take a look! |
spark/src/main/scala/org/apache/spark/sql/delta/stats/StatisticsCollection.scala
Outdated
Show resolved
Hide resolved
spark/src/main/scala/org/apache/spark/sql/delta/stats/StatisticsCollection.scala
Outdated
Show resolved
Hide resolved
spark/src/main/scala/org/apache/spark/sql/delta/stats/StatisticsCollection.scala
Outdated
Show resolved
Hide resolved
…csCollection.scala Co-authored-by: Thang Long Vu <[email protected]>
spark/src/test/scala/org/apache/spark/sql/delta/stats/StatsCollectionSuite.scala
Show resolved
Hide resolved
@@ -599,6 +599,27 @@ trait DataSkippingDeltaTestsBase extends DeltaExcludedBySparkVersionTestMixinShi | |||
deltaStatsColNamesOpt = Some("b.c") | |||
) | |||
|
|||
testSkipping( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we consider double struct test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a double nested struct as well
}, | ||
"i": 10 | ||
}""".replace("\n", ""), | ||
hits = Seq( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we have hit tests for elements inside struct and the double struct as well?
Btw click on re-request review so that I get notified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added more hit tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that when we are inside struct, we gather stats for the valid datatypes inside the struct, but the invalid ones we still don't right? Would we want to test that we still don't skip the invalid datatypes inside the struct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, if we can check another skipping valid type beside integer, like timestamp, string, ... that would be great
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically null counts are collected, but they can't be used for skipping. We could add that to the test, but seems kind of out of scope because you can't do min/max things with arrays/maps anyway, so it'd have to be like an element contains or something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there are other non-skipping eligible datatypes than lists datatypes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how to specify a Binary type in the inferred JSON. Is that the only non-complex type?
spark/src/test/scala/org/apache/spark/sql/delta/stats/StatsCollectionSuite.scala
Show resolved
Hide resolved
@Kimahriman Just out of curiosity, why are you making this change? Do you have some complex nested schemas that you want to skip data on? |
Yes, we have structs with mixed arrays and primitive types. Currently I have to recursively count the fields to set the num index columns to the right value |
} | ||
case _ if insideStruct => columnPaths.append(name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we are inside the struct, we keep appending columns regardless of the data type right?
So would the description of the "@param columnPaths" no longer correct?
This seems outdated as well
delta/spark/src/main/scala/org/apache/spark/sql/delta/stats/StatisticsCollection.scala
Line 481 in da5a5d2
* 2. Delta statistics column must exist in delta table's schema. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically it is "valid", it will actually collect null counts on all fields regardless of if min/max is supported
@@ -458,15 +458,20 @@ object StatisticsCollection extends DeltaCommand { | |||
* @param name The name of the data skipping column for validating data type. | |||
* @param dataType The data type of the data skipping column. | |||
* @param columnPaths The column paths of all valid fields. | |||
* @param insideStruct Whether the datatype is inside a struct already, in which case we don't |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of saying "we don't clear", we can make it clearer what we mean by that I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed, not sure if it's anymore clear
@@ -608,7 +590,8 @@ class StatsCollectionSuite | |||
|
|||
Seq( | |||
"BIGINT", "DATE", "DECIMAL(3, 2)", "DOUBLE", "FLOAT", "INT", "SMALLINT", "STRING", | |||
"TIMESTAMP", "TIMESTAMP_NTZ", "TINYINT" | |||
"TIMESTAMP", "TIMESTAMP_NTZ", "TINYINT", "STRUCT<c3: BIGINT>", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like there are relevant tests in DataSkippingDeltaTests as well
delta/spark/src/test/scala/org/apache/spark/sql/delta/stats/DataSkippingDeltaTests.scala
Line 442 in da5a5d2
"b.c.d < 0", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those tests are for specifying the number of indexed columns, not columns by name
Hi Adam, I intentionally throw error when the column doesn't support delta stats. |
If we want to swallow the supported error, there should be a way to tell user that the column is unsupported. |
Happy to log a warning, but it's not actually unsupported. Null counts are still collected for all columns. If users are confused why their arrays and maps aren't collecting mins and maxes, they have bigger problems. This behavior matches how the number of indexed columns work. |
Warming sounds a good indicator. |
Added a warning log, let me know if you have thoughts on exact verbiage, just copied the error text |
} | ||
case SkippingEligibleDataType(_) => columnPaths.append(name) | ||
case d if insideStruct => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we also make the non-struct field have the same behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Up to you. The idea was if you directly specify an in-eligible type, throw an exception because you did something wrong. If you specify a struct, just work with the sub fields that are supported
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then, I am good about it.
Thank you
Which Delta project/connector is this regarding?
Description
Resolves #2822
Make
delta.dataSkippingStatsColumns
more lenient for nested columns by not throwing an exception if a nested column doesn't support gathering stats. This more closely matches the behavior of thedataSkippingNumIndexedCols
which allows for unsupported types in those columns (and seems to still gather null counts for those unsupported types). This also allows more use cases where you might have a wide variety of types inside a top level struct, and you simply want to gather stats on whatever columns inside that struct you can.I kept the duplicate column checking in place to avoid less changes, but I'm not sure how necessary that really is besides letting users know they are doing something dumb.
How was this patch tested?
A couple tests were removed that were specifically testing for the now-allowed behavior, and a new test was added to verify the new behavior works.
Does this PR introduce any user-facing changes?
Yes, specifying a struct with unsupported stats gathering types in
delta.dataSkippingStatsColumns
is now allowed instead of throwing an exception.